Skip to content

feat: add batch inference API to llama stack inference#1945

Merged
ashwinb merged 6 commits intomainfrom
batchinfer
Apr 12, 2025
Merged

feat: add batch inference API to llama stack inference#1945
ashwinb merged 6 commits intomainfrom
batchinfer

Conversation

@ashwinb
Copy link
Copy Markdown
Contributor

@ashwinb ashwinb commented Apr 11, 2025

What does this PR do?

This PR adds two methods to the Inference API:

  • batch_completion
  • batch_chat_completion

The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration.

Why did I not add this to Api.batch_inference though? That just resulted in a lot more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true asynchronous execution. So you can submit requests, and it can return a Job instance, etc.

Test Plan

Run meta-reference-gpu using:

export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu

Then run the batch inference test case.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 11, 2025
Copy link
Copy Markdown
Contributor

@ehhuang ehhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much is the speed up? Just curious.

@ashwinb
Copy link
Copy Markdown
Contributor Author

ashwinb commented Apr 11, 2025

How much is the speed up? Just curious.

I will calculate some aggregate toks/sec values by running a bunch of examples (from evals) linearly vs batch.

@ashwinb
Copy link
Copy Markdown
Contributor Author

ashwinb commented Apr 12, 2025

Corresponding llama-stack-client changes: llamastack/llama-stack-client-python#220

ashwinb added a commit to llamastack/llama-stack-client-python that referenced this pull request Apr 12, 2025
@ashwinb
Copy link
Copy Markdown
Contributor Author

ashwinb commented Apr 12, 2025

@ehhuang Here are some numbers for various batch sizes running for 100 samples of the BFCL benchmark:

llama-4-scout

batchsize time
1 2:36
8 3:55
16 2:50
32 1:46

llama-3.3-70b

batchsize time
1 3:16
8 4:38
16 3:18
32 2:02

My conclusion: this batch inference implementation is far from "effective" in terms of accelerating inference substantially. However, it is a good first step. Most of the work in the PR is infrastructure. Furthermore, we can now connect to vLLM's (inline) batch APIs when needed.

@ashwinb ashwinb merged commit f34f22f into main Apr 12, 2025
22 checks passed
@ashwinb ashwinb deleted the batchinfer branch April 12, 2025 18:41
MichaelClifford pushed a commit to MichaelClifford/llama-stack that referenced this pull request Apr 14, 2025
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants